-
Notifications
You must be signed in to change notification settings - Fork 902
Cache optimization #2339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache optimization #2339
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces significant performance optimizations to the MultifrontalSolver, focusing on cache locality, memory efficiency, and parallel execution granularity. The changes implement QR factorization for leaf cliques, a fused load-and-eliminate path, cached factorization results, and improved memory management, delivering substantial speedups on larger datasets while introducing architectural improvements for precomputed symbolic data.
Key changes:
- Implements QR elimination mode for high aspect-ratio leaf cliques without prior Hessian factors
- Introduces fused
eliminateInPlace(graph)that interleaves loading and elimination in a single traversal - Caches factorization results in separate
RSd_matrix for optimized back-substitution - Adds precomputed symbolic data support via
PrecomputedDatastruct andPrecompute()method - Increases parallel task threshold from 10 to 4096 to reduce scheduling overhead for small cliques
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| gtsam/linear/MultifrontalSolver.h | Adds PrecomputedData struct, new constructor overload, Precompute() static method, eliminateInPlace(graph) overload, and state tracking flags |
| gtsam/linear/MultifrontalSolver.cpp | Refactors constructor to use precomputed data, implements fused elimination path, adds row counting for VBM sizing, updates parallel thresholds |
| gtsam/linear/MultifrontalClique.h | Adds QR mode, damping methods, RSd_ caching, prepareForElimination/factorize separation, IndexedSymbolicFactor row tracking |
| gtsam/linear/MultifrontalClique.cpp | Implements lazy SBM allocation, QR leaf factorization, damping support, cached factorization, separator-only SBM for QR updates |
| gtsam/base/SymmetricBlockMatrix.h | Adds utility methods for diagonal damping (addToDiagonalBlock, addScaledIdentity) |
| gtsam/linear/tests/testMultifrontalSolver.cpp | Updates tests to explicitly call load(), adds test for new eliminateInPlace(graph) API |
| timing/timeSFMBAL.h | Adds createOrderings helper function, minor formatting fixes |
| timing/timeMultifrontalSolver.cpp | Refactors benchmarks into separate functions, uses fused elimination path, adds BAL135 test |
Linux Timing (TBB)Similar conclusions. Chain improvement is massive, BAL less so, and speedups against GTSAM are less pronounced than on Mac, but this might just be because Linux is a 20 core machine with 32G RAM and a nice cache: Analysis
AfterBefore |
|
Unfortunately, without TBB on Linux, for BAL datasets, we have worse than GTSAM results - which stumps me. The overall picture seems to be: Chains:
BAL:
Some detailed profiling on Linux, single-thread, might at least bring that on par. |
|
@ProfFan the |
|
PS for chains even single-threaded Linux is 4-6 times faster, it's really something about the massive fan-in for BAL cliques. |
ProfFan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do more profiling with this PR merged
Overview
This PR introduces significant performance improvements and architectural updates to the MultifrontalSolver, focusing on cache locality, memory efficiency, and parallel granularity.
Key Optimizations:
QR Elimination for Leaves:
Cache Locality & "Fused" Path:
Improved Memory Management:
RSd_): The solver now explicitly caches the elimination resultParallel Task Tuning:
Architectural Changes
Precomputed Symbolic Structure:
Damping Support:
Timing and Analysis
Below are the timings in Mac, with TBB enabled. M1 Macbook with only 16G Ram.
The optimizations significantly scalability, at the cost of a small regression on very small problems.
Chain benchmarks
BAL benchmarks
Timings after MultifrontalSolver optimizations, and before:
After
Before